Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Update String toLocale{Lower,Upper}Case to ResolveLocale with best-fit matching #956

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

gibson042
Copy link
Contributor

Fixes #896

  • Do not ignore locales after the first in the list returned by CanonicalizeLocaleList(locales) (observable via e.g. "I".toLocaleLowerCase(["zzz", "tr"]) === "ı").
  • Match against the items of that list by best-fit rather than prefix, aligning with the rest of ECMA-402 (although the difference is not necessarily observable).
  • When the list is not empty but no matching locale is found, default to DefaultLocale() rather than "und", aligning with empty-list behavior and with the rest of ECMA-402 (observable via e.g. "I".toLocaleLowerCase("zzz") === "I".toLocaleLowerCase() regardless of default locale, just like new Intl.Collator("zzz", { sensitivity: "variant" }).compare("i", "ı") === new Intl.Collator(undefined, { sensitivity: "variant" }).compare("i", "ı")).

@anba
Copy link
Contributor

anba commented Jan 30, 2025

  • Do not ignore locales after the first in the list returned by CanonicalizeLocaleList(locales) (observable via e.g. "I".toLocaleLowerCase(["zzz", "tr"]) === "ı").

But also "I".toLocaleLowerCase(["en", "tr"]) === "ı", because "en" will generally don't have locale-sensitive case mappings, which means the next locale in the list gets selected.

  • Match against the items of that list by best-fit rather than prefix, aligning with the rest of ECMA-402 (although the difference is not necessarily observable).

"best fit" matching isn't supported in browsers. (V8 has "--harmony-intl-best-fit-matcher", but that's not available by default.)

  • When the list is not empty but no matching locale is found, default to DefaultLocale() rather than "und", aligning with empty-list behavior and with the rest of ECMA-402 [...]

That means "I".toLocaleLowerCase("und") can now return either "i" or "ı", depending on the user-locale.

@gibson042
Copy link
Contributor Author

  • Do not ignore locales after the first in the list returned by CanonicalizeLocaleList(locales) (observable via e.g. "I".toLocaleLowerCase(["zzz", "tr"]) === "ı").

But also "I".toLocaleLowerCase(["en", "tr"]) === "ı", because "en" will generally don't have locale-sensitive case mappings, which means the next locale in the list gets selected.

Ah yeah, I guess the Available Locales List needs to include more than just locale identifiers with language-sensitive case mappings. Any thoughts on what it should be? Mayble %Intl.Collator%.[[SortLocaleData]]?

  • Match against the items of that list by best-fit rather than prefix, aligning with the rest of ECMA-402 (although the difference is not necessarily observable).

"best fit" matching isn't supported in browsers. (V8 has "--harmony-intl-best-fit-matcher", but that's not available by default.)

That's irrelevant, because LookupMatchingLocaleByBestFit is defined to produce results "at least as good as those produced by the LookupMatchingLocaleByPrefix algorithm" (and therefore any implementation is free to just reuse LookupMatchingLocaleByPrefix).

  • When the list is not empty but no matching locale is found, default to DefaultLocale() rather than "und", aligning with empty-list behavior and with the rest of ECMA-402 [...]

That means "I".toLocaleLowerCase("und") can now return either "i" or "ı", depending on the user-locale.

I believe that would also be addressed via the Available Locales List provided to ResolveLocale, but regardless should align with other Intl services in general and Intl.Collator in particular. Probably, "und" should just always be considered available, but definitely should not be given special treatment exclusively in TransformCase.

@anba
Copy link
Contributor

anba commented Jan 31, 2025

Ah yeah, I guess the Available Locales List needs to include more than just locale identifiers with language-sensitive case mappings. Any thoughts on what it should be? Mayble %Intl.Collator%.[[SortLocaleData]]?

If I had to guess, I'd say one of the reasons locale case conversion works differently from the other APIs, is that it's difficult to find an appropriate Available Locales list. I'm not sure if Intl.Collator is a good fit.

That's irrelevant, because LookupMatchingLocaleByBestFit is defined to produce results "at least as good as those produced by the LookupMatchingLocaleByPrefix algorithm" (and therefore any implementation is free to just reuse LookupMatchingLocaleByPrefix).

I had assumed "the difference is not necessarily observable" was in reference to actual browser behaviour. If we assume an implementation that supports "best fit", which most likely uses the data from https://github.com/unicode-org/cldr/blob/main/common/supplemental/languageInfo.xml, then it's possible to have observable differences. There are three relevant entries:

<languageMatch desired="ku"	supported="tr"	distance="30"	oneway="true"/>
<languageMatch desired="azb" supported="az" distance="10" oneway="true"/>
<languageMatch desired="az"	supported="ru"	distance="30"	oneway="true"/>

That means "I".toLocaleLowerCase("ku") may fallback to "I".toLocaleLowerCase("tr"), because there's the fallback "ku" → "tr". For example V8 doesn't ship locale data for Kurdish (Intl.Collator.supportedLocalesOf("ku") returns the empty array), so if V8 started to officially support the "best fit" matcher, but string conversion is tied to the Intl.Collator Availables Locales, then "I".toLocaleLowerCase("ku") could start to return the dot-less i (U+0131).

I believe that would also be addressed via the Available Locales List provided to ResolveLocale, but regardless should align with other Intl services in general and Intl.Collator in particular. Probably, "und" should just always be considered available, but definitely should not be given special treatment exclusively in TransformCase.

I gave "und" as a special case, because at least for programmers with a Java background, using "und" shouldn't be too uncommon. (Java's String case conversion methods use java.util.Locale.getDefault() by default, which can result in bugs when the default locale is Turkish/Azeri. Instead it's necessary to use str.toLowerCase(Locale.ROOT).)

@gibson042
Copy link
Contributor Author

Updated per TG2 discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Previously Discussed
Development

Successfully merging this pull request may close these issues.

toLocaleLowerCase/toLocaleUpperCase should better align with the rest of ECMA-402
2 participants